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Abstract 

In this paper, we propose a new approach, called the LAGS, short 
for "least absulute gradient selector" , to this challenging yet interest- 
ing problem by mimicking the discrete selection process of Iq regular- 
ization in linear regression. To estimate (3 under the influence of noise, 
we consider, nevertheless, the following convex program 

1 v 
(3 = arg mm-\\X T (y - X (3)11! + \ n S2 My, X;n)\Pi\ 

i=i 

A n > controls the sparsity and Wi > dependent on y,X and n is 
the weights on different n is the sample size. Surprisingly, we shall 
show in the paper, both geometrically and analytically, that LAGS 
enjoys two attractive properties: (1) LAGS demonstrates discrete se- 
lection behavior and hard thresholding property as /o regularization by 
strategically chosen Wi, we call this property "pseudo-hard threshold- 
ing"; (2) Asymptotically, LAGS is consistent and capable of discover- 
ing the true model; nonasymptotically, LAGS is capable of identifying 
the sparsity in the model and the prediction error of the coefficients 
is bounded at the noise level up to a logarithmic factor — logp, where 
p is the number of predictors. 

Computationally, LAGS can be solved efficiently by convex pro- 
gram routines for its convexity or by simplex algorithm after recasting 
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it into a linear program. The numeric simulation shows that LAGS 
is superior compared to soft-thresholding methods in terms of mean 
squared error and parsimony of the model. 



1 Introduction 

One of the most widely used model in statistics is the linear regression. In 
many applications, scientists are interested in estimating a mean response X/3 
from the data y = (yi,y2, ■■■,y n )- The p— dimensional parameter of interest 
(3 are estimated from the linear model 



where a is the intercept, e the noise. A common assumption is that e is 
Gaussian with ej ~ A/(0, a 2 ), but this is not an essential requirement as our 
methods are applicable to other types of noise which have heavier tails. 

Scientists usually have no information of the underlying models, a large 
number of predictors are chosen in initial stage to attenuate the possible 
bias and variance as well as to enhance predictability. Regression proce- 
dures capable of identifying the explanatory variables (others are set to be or 
shrunken to 0) and obtain good estimates of the response play a pivotal role. 
Ideally, Best Subset regression such AIC [1J, C p [12J, BIC pE] and RIC [TO] 
achieve the trade-off between model complexity and goodness of fit. These 
estimators are essentially least square penalized by Iq norm of (3 with differ- 
ent control coefficients. A standard formulation of Iq penalized least square 
is 



A is the control coefficient. For the rest of the paper, we assume that columns 
of X are standardized and y is centered. Under this setting a = 0, hence we 
omit it. 

Iq penalized least square has the hard thresholding property that keeps 
the large coefficient intact while sets small ones to be zero. However, unfor- 
tunately, solving ([2]) is a discrete process which needs to enumerate all the 
possible subset of (3; the combinatorial nature of Iq norm limits the applica- 
tion of ([2]) when the number of predictors is large. As a compromise, convex 
relaxation to 1% norm such as the Lasso [H] is a widely used technique for 



y = a + X/3 + e 



(1) 



(a, (3) = arg min-||y - al - Xf3\\ 2 2 + A||/3|| 



(2) 
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simultaneously estimation and variable selection. The Lasso is 

/3 = argmini|| 2/ -X/3||2 + AH/311! (3) 

The h penalty shrinks (3 towards zero, and also sets many coefficients to be 
exactly zero. Thus, the Lasso often regards as the substitute of ([2]). 

The Lasso and the subsequently appeared methods, such as the LARS 
[SJ, elastic-net [20], adaptive Lasso [TH], Dantzig Selector [2] to name a few, 
closely relate to the soft thresholding in signal processing jl] in contrast to 
hard thresholding as in ([2]). Since the soft thresholding both shrinks and 
selects, it often results in a model more complicated than the true model in 
its effort to spread the penalty among the predictors. As a greedier attempt, 
SCAD [3 El E] and SparseNet [I3]penalize the loss function by non-convex 
penalties. Similar to Q3J) , f3 is estimated by 

1 P 
(3 = arg min-||y - X/3g + A^ P(|A|; A; 7) (4) 

i=l 

where P(|/3j|; A; 7) defines a family of penalty functions concave in /3, and A 
and 7 controls the sparsity and concavity. It has been shown that (HI) enjoys 
better variable selection properties compared to l\ relaxation; whereas, both 
algorithms cannot assure to find the global optimal. 

In this paper, we propose a new approach, called the LAGS, short for 
"least absolute gradient selector" , to this challenging yet interesting problem 
by mimicking the discrete selection process of Iq regularization. To estimate 
(3 under the influence of noise, we consider, nevertheless, the following convex 
program 

1 P 
(3 = arg min- 1| X T (y - X/3) ||i + A n ^ w t (y; X; n) \ ft| (5) 

i=l 

A n > controls the sparsity and Wi > dependent on y, X and n is the 
weights on different n is the sample size. Surprisingly, we shall show in the 
following sections, both geometrically and analytically, that (|5| demonstrates 
discrete selection behavior as Zo penalty and hard thresholding property by 
strategically chosen Wi, we call this property "pseudo-hard thresholding". The 
graphical comparison of hard thresholding and pseudo-hard thresholding is 
given in prostate cancer example in section 7. 
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The rest of the paper is organized as follows: section 2 presents the mo- 
tivation of LAGS and connects the ideas with previous work; section 3 es- 
tablishes the theorem regarding the properties of LAGS and highlights the 
"pseudo-hard thresholding"; section 4 shows the potential problems associ- 
ated with the Dantzig Selector and provides a neat way to choose wf, section 
5 are the proofs of the theorems; section 6 discusses the computational is- 
sue of how to solve LAGS; section 7 demonstrates it by numeric examples; 
discussion and future work are in section 8. 



2 The LAGS 

2.1 Insight from orthonormal case 

The properties of hard thresholding can be better understood when the design 
X is orthonormal, i.e., X T X = I. The solution for (l2|) is 



& fe =)W?l>A) (6) 



as a reference, the Lasso solution is 



/3; aSS ° = sign(/3J)(|/3J|-A) + (7) 

where (3° = X T y is the ordinary least square (OLS) estimate. Notice that 
the hard thresholding operator (|6j) is discontinuous with the jump at = 
A, while the soft thresholding operator d7]) is continuous. If we assume 
|0i I > -\Pp \ > \/3p +i\- > \/3p\, another property of Q is that any A G 
(I^poI' I^po+iI) wn ^ ee P ^ ne fi rs ^ P° coefficients- Based on this property, we 
define "pseudo-hard thresholding" as 

Definition 2.1. A penalized regression has "pseudo-hard thresholding" prop- 
erty if there exist some intervals of R + , such that changing sparsity parameter 
A in these intervals will keep the coefficients unaltered. 

In order to achieve pseudo-hard thresholding in convex world, consider 
function of 9 

f(Q- z ) = \ Z -0\ + l\Q\ ( 8 ) 



Minimizing f(8; z), the solution is 

z 7 < A 



7 >A ^ 
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Figure 1: Plot of f(9; z). The left panel has 7/A < 1; the right panel 7/A > 1. 
It is obvious that the minimum is obtained at x = 1 on the left, while x = 
on the right 




9 is a discontinuous function of 7, with the breaking point at 7 = A. Mo- 
tivated by this special property of (18]), we formulate LAGS as in As a 
matter of fact, the idea to minimize ||A T (y — A/3)|| is pioneered by [2] in 
Dantzig Selector 

/9 = arg min||/3|| a subject to \\X T (y - Xfi)^ < t (10) 

which can be written equivalently as 

P = arg min\\X T (y-Xp)\\ 00 subject to < t (11) 

One reason why ||A T (y — A/3)|| should be small given in [2] is that a good 
estimate of /3 should be independent of orthogonal transformations. Another 
more compelling yet insightful argument is that in OLS one needs to minimize 
foLS = l\\y-X/3\\l, its gradient is VJols = -X T (y-Xf3); solving VfoLS = 
results in OLS estimates, hence, one can expect that a good j3 should 
shrink V foLS towards 0. In order to incorporate ([8]), l\ is chosen here — from 
where the name "LAGS" comes; we will show in the following sections that 
this choice of norm can set some elements of the gradient to zero, which 
means unrestricted coefficients for them, like in Iq penalty. Moreover, if we 
instead consider the absolute deviance by substituting ||A T (?/ — X/3)\\i with 
\\y— X{3\\i, it is the LAD-Lasso [I?] ; but LAD-Lasso has no hard thresholding 
property even in orthonormal design case. 
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2.2 The weight W{ matters 

In the orthonormal case, ^ becomes 

fa = arg mm\(3° - fa\ + \Wi\fa\,i = h -,P ( 12 ) 

One heuristic to choose Wi is to consider the correlation q between y and 
Xj — the zth column of X: if |cj| is large, which means zth predictor may be 
a good explanatory variable, hence fa should be penalized less; otherwise, it 
should be penalized more. Thus, we set 

w i = r~\ ( 13 ) 

with a little abuse of notation, Wi = oo when Cj = 0. Without loss of 
generality, we assume that |ci| > ...|c Po | > ... > \c p \, which implies W\ < 
...w po < ... < w p . By choosing A, s.t. w~ 0+1 < A < w~^, we have fa = (3°, i = 

1, ...,Pq] fa — 0, i — po + 1, which is pseudo-hard thresholding and /3 is 
identical with ([6]). 

To push this heuristic further, we notice that q is the coefficient of OLS 
in the orthonormal design case. This suggests that we can choose w^ 1 as 
absolute value of OLS coefficients /3f, see section 4 for detail. 



3 Properties of LAGS 

In section 2, some properties of LAGS are demonstrated in the simplest 
case. These properties are not incidental. To formally state our results, 
we decompose the regression coefficient as (3 = ((3^,(3^), where (3^ = 
(/?!,..., (3 P0 ) corresponds to the true parameters and (3^ = (/3 P0 +i, (3p) are 
redundant; the columns of X are decomposed alike. Furthermore, define 



and 



a n = max {w™} (14) 

1<7<P0 



b n = min {<} (15) 

Po+l<J<P 



We assume three conditions: 
(a) y = x^ is> (3^ + e, where e is noise with mean and variance a 2 . 
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(b) c. 



1 



X T X C and Cr, 



11 ^12 
°21 u 22 

C n and C are positive definite matrices. 



c 



C21 C22 



where 



(c) 1 1 C^Ci2 1 1 00 < 1 - V and IKC^) -1 ^!^ < 1 - r] n , where 77 and rj n are 
positive constants. This condition is established in [19] and also called 
Irrespresentable Condition in [18]. 



C is actually the covariance matrix of predictors. Consider set fl = {s e R p : 
|| s || oo = 1, max{|si|, |s Po |} = 1}, which is closed and contains in the unit 
sphere under norm, hence Q is a compact set. Let us define 

7 = min{||[C 11 ,C 12 ]( S ( 1 \ S ( 2 )) T || 00 } (16) 

here we partition s as above. It is obvious that ||s^||oo = 1 and Hs^Hoo < 1> 
thus \\[I, C^^sW^ > Hs^lU - HCVM^IIoc > 1 - (1 - rj) = v- 7 > 
is trivial by noting that \\Cnv ||oo = v = 0. 



Theorem 3.1. We can find a, b dependent on C , typically, a = 7, b = 
such that if\im\ n a n < a and limA n 6 n > b, then LAGS is consistent and has 
pseudo-hard thresholding property. 



The proof is given in Section 5. Theorem 3A gives another hint that the 
weights will play an important role for the effectiveness of (JSJ). To clarify 
the pseudo-hard thresholding property, it is helpful to interpret LAGS geo- 
metrically. l(/3;\ n ) = ^\\X T (y-Xf3)\\ 1 + \ n Y J P i= iUJi(y;X;n)\f3 i \ is piecewise 
linear in (p + 1)— dimensional space. We assume there are no flat regions — in 
which l((3) are constant, then l(/3) must obtain its minimum on some break- 
ing point (3' otherwise there exist descent directions (consider the simplex 
algorithm). When n is fixed, as we change X n with a tiny amount, the value 
in each breaking point may change, but it is possible that no values on other 
breaking points catch up l((3') after the change; if this is the case, /3' will 
still be the minimizer even though the A n is different. To illustrate this point 
graphically, let us consider a toy 1-dimensional example y = (7, 2,4, 2) T , 
X = (2,3,5, 7f 

l{(3) = |7 - 20| + |2 - 30| + |4 - 5/3| + \2 - 7(3\ + A|0| 

When A = 1,2,5, l(/3; 1) and Z(/3; 2) are both minimized at /3 — 2/3, while 
Z(0; 5) at = 2/7 as shown in Figure [2j It implies even though A increases 
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Figure 2: Graphical illustration of pseudo-hard thresholding and discrete 
selection process in one dimension. The blue line is A = 1; the green line 
A = 2; the red line A = 5. The optimal /3s are indicated by the dotted vertical 
line. 
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from 1 to 2, the values at other breaking points do not catch up 1(2/3), but 
when increases to 5, 1(2/3) is caught up by 1(2/7). The consistency can be 
understood in a similar fashion: as n increases, the number of breaking points 
increases exponentially, which provides more candidates to solve (]5]). As a 
consequence, the probability to discover the true model increases. 

The discrete selection nature of LAGS can also be understood under 
this framework: only moves from one breaking point to another and the 
breaking points are scatterred in p— dimensional space, which implies $ is 
selected discretely. Fortunately, this feat of LAGS can be achieved by solving 
a tractable convex program rather than enumerating all the possible breaking 
points as l regularized regression. 



4 How to choose weights 
4.1 Dantzig Selector 

The performance of Dantzig Selector (DS) and its similarities with Lasso and 
LARS are discussed in [6]. In their several numerical studies, the coefficient 
profiles of DS seem to be wilder and have some erratic fluctuations; the 
prediction accuracy is also inferior. DS can be re-expressed in penalized 
form as 

= arg mm\\X T (y - X T (3)\\oo + (17) 
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We denote the penalized form as Ids{P), it is also a piecewise linear function. 
We argue that one possible reason for these properties in |6] is that the 
penalties on the coefficients are uniform. 

Example. If l DS (P) = max{|l - |2 - f3 2 \} + A(|/?i| + \/3 2 \) by carefully 
chosen y and X. As discussed in Section 3, the piecewise linear function will 
obtain its minimum at one of its breaking points. In this example, there are 
total four such points: (0, 0), (0, 2), (1, 0), (1, 2), thus 

min Ids(P) = min{2, 1 + 2A, 2 + A, 3A} 

f 2 if A < 2/3, p= (0,0) 
min/^5 = < 2 if A = 2/3, there are infinit /3s 
{ 3A if A > 2/3, p= (1,2) 

So DS will threshold both (3i and (3 2 or keep both intact unless A = 2/3, where 
the solution is not unique. LAGS has the similar behavior if the weights 



are uniform, i.e., if we take XwiS are equal in (12). However, Theorem 3.1 



requires the weights converge to different values for true and noisy predictors; 
hence, we conjecture that if we choose the weights for DS as with LAGS, the 
counterintuitive behaviors can be eschewed in some extend. We show that 
numerically in section 8. 



4.2 How to choose W{ 

Imposing different weights on the coefficients to enhance predicability is dis- 
cussed in [T5] and [17] . Both suggest to use the inverse of ordinary least 
square estimate as the weight. In the orthonormal case, the adaptive lasso 
estimates for 9 are obtained by 

P? aptme = argmin \{p> - (3f + A-^|0| 

where j = 1, 2, ...,p. Therefore, ^ dapttve = sign(/3°)(|/3°| — p|j^) + . Compared 

with Lasso solution (|7j), adaptive lasso shrinks (3° towards less for larger 
ft?; as a result, it is shown that it introduces less bias than Lasso. The 
weight derived in LAD-Lasso [17J is based on the Bayesian perspective: if 
each coefficient is double-exponentially distributed with location and scale 
Aj, then the log-likelihood of the posterior is: 

n p 

^log/(ej) + >^ \j\f3j\ - log(Ai) + constant 

1=1 8=1 
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minimize it with respect to Aj, which leads to Aj = 1/|A|. However, we 
do not have the oracle to know in advance; hence, at first step, we need 
a coarse estimate of a natural choice will be the Pols- Surprisingly, in 
what follows in this section, it is shown that this is also a good choice for the 
weights in LAGS. 

Theorem 4.1. If we choose vj\ = l/\/3oL,s\i> then asymptotically, conditions 



for a n and b n in Theorem 3.1 is satisfied, where Pols — {X T X) 1 X T y. 



What if p > n which is not addressed in (TTJ [19] ? The number of variables 
p is larger than the sample size n frequently arises in applications. The ordi- 
nary least square fails because the solution of it is not unique. Nevertheless, 
ridge regression is a shrinkage estimator that can enhance the predictability 
of the model, so we use ridge solution as the weights for LAGS; moreover, 
the ridge regression solution can be obtained by simply solving the similar 
equation, namely f3 r id ge = (X T X + (fil)~ 1 y, where is some positive constant; 
then Wi = l/\/3 rid9e \i. 

There are two nice properties associated with our choice of wf 

(1) Without loss of generality, let us consider the scenario that there are 
several groups of identical predictors, the ridge estimate of (3 will always 
put equal weights on the identical predictors. The reason is that the 
optimization problem 

mina^ + a\ + ... + ct\ subject to a\ + «2 + ••• + otk = r\ 

will obtain its optimal if and only if <x\ — ct2 — ... = = r]/k. This is 
desirable since we do not bias towards any predictors. 

(2) LAGS is a shrinkage estimator in the sense that 

~[ \POLS\i 

It is easily followed by the fact that 10; A) < 1({3ols] A) and X T (y — 
XPols) = 0, thus 



V^-X/3)|| 1 + AX:^<A P 

n ~i \POLS\i 
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Suppose now that e ~ A/"(0, a 2 ), our next result is non-asymptotic, which 
claims that under the choice Wi = l/\/3oLs\i, LAGS can accurately identify 
the underlying model and estimate the coefficients. 

Theorem 4.2. Under the choice of Wi = 1/\Pols\% and suppose that 



min > c\/2\ogpa (18) 

ie{i,2,..., Po } 



K '" I|: "<M (19) 



7n 

where 7„ = min sgn {|| [C{\, C" 2 ](si, s 2 ) T ||oo} as wi @)- If £ > satisfies 
(c — £)/£ > M, then we can choose 

Z^logpaWCnWoo < A < (c-O^logptrTn (20) 

A = 0,i = po + l,..-,P (21) 

A = (^OLs)i,« = l,-,Po (22) 

||/3 < 2e-Po-logp-a 2 (23) 

are satisfied with probability at least (1 — n~ 1 l 2 £ i ~ 1 (n\ogp)~ 1 l 2 Kp~ n£ ? I , 
where k is a constant dependent on Cn 1 ^ 2 ■ 

In words, the nonzero coefficients should significantly stand above the 



noise as indicated by (18) and ||C n || 00 /7 n is uniformly bounded by M as 



in (19), £ is chosen to make the set (20) nonempty. If all these conditions 



are satisfied, LAGS identifies the correct variables and only these with large 
probability. Moreover, the coefficients of identified variables are set to be 
the OLS estimate, which is analogous to hard thresholding. The accuracy of 



LAGS is quantified by (23), the mean squared error is proportional to the 
true number of variables times the variance of the noise with the logarithmic 
factor — logp, which is unavoidable since we do not have the oracle to know 
the set of true predictors in advance [31 E] . 
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5 Proofs of Theorems 



5.1 Proof of Theorem 13.1 



Proof. LAGS is a convex program, thus it obtains its minimum at some $. 
By the theory of convex optimization, the subgradient at /3 is zero 



V/(/3) = - ^-^sign(X T y - X T XP) 



n 



+ A„Oisign(/3i), w Po sign(/3p ),Wp 0+ isign(/3 ; 



'po+D, •••) 



w 



p sign0 p )) rj 



(24) 



where sign(x) = A for x ^ and sign(x) G [—1, 1] for x = 0. We take 

b = || C || oo • Since b = limA n 6 n > ||C||oo, for any 5 G (0,(6 — b)), there 
exits N%(5) and n > Ni(S) such that ||C||oo + < A n 6 n . Similarly, since 
C n — )■ C as in condition (b), there exits N 2 (5) and n > N 2 (5) such that 



The optimal condition (24) implies 



sign(X J y - X 1 X/3) = A n ^sign(A) 



(25) 



Notice that when po + 1 < i < p, 



A„6 n < A n Wj 



and 



Hence 



X T X 



n 



sign(X T y - X T Xfi) 



< 



,x T x, 



n 



A n 6 n |sign(/3i)| < 



,X T X, 



n 



(26) 



for n > N(5) = max{Ni(S), N 2 (5)}, we have |sign(/3j)| < 1, which implies 
A = 0. 

Analogously, we take a = 7, where 7 is defined in Section 3; when 1 < 
i < Po and since a = limA n a n < a, for any 5' G (0, (a — a)), there exists iV(5') 
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and n > N(S') such that min se Q{|| Cf 2 ]s||oo} > 7 ~ 6' an d A n a n < 7 — 5'. 
Choose n > N = max{N(6), N(6')}, we have /3 (2) = 0, then 



sign(X J ?/ - X J X/3) = sign I I x(2) 



sign 



X( 2 ) T X«(/§W-/3«) 



(27) 



We claim that X^e - X^X^^ - = 0. 
If it is not true, then 

s' = sign(X T e - X T X0 - /?)) e 



which upon combining with ( 25 ) gives 



1-6' < ||[Cn,Ci2]s'||oo < Ka r . 



(28) 



contradicts with our choice of n. 

We impose the superscript n on /3 to make it explicitly dependent on n. 
Therefore, for n > N, 

^\\X^ T e-XW T X^0 nW -(3^=0 
n 

Combining E(e) = and the law of large numbers, 



X^\ 



->■ 



n 



by condition (b) 



hence, 



x m T xw 



n 



(3 n(1) -> /3 



-►Cu 



(i) 



which implies LAGS is consistent. 

The pseudo-hard thresholding property holds for n > N as well. In fact, 
as shown above, if inequalities (26) and (28) are satisfied, then we have 



= (x^x^)- 1 *^ 



(29) 
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and 

/3 (2) = (30) 

Therefore, for any sequence {\ n Wi, A n w 2 , X n w p }, if \ n a n < a — 5' and 

^rfin > b + 5, then the LAGS solutions are the same. □ 



5.2 Proof of Theorem 14.1 

Proof. The normal equation is 

X T Xf3 = X T y (31) 
where y = X^(3^ + e. By condition (a), (b), 

-X T X -> C, -X T e -> 0, -X T X«/3« -> ( ° u \ 
n n n \ G21 / 



n 

so 



/ 3(D_ >/3 (i) )/ 3(2)^^(2) = 

Hence, when n is large enough, we can choose A„, such that \ n /\fioLs\i < 7 
for i = 1,2, ...,p and A n /|/3 Ls|i > ||C||oo for i=p + l, ...,p. □ 



5.3 Proof of Theorem 4.2 



In order to prove this theorem, we need the following lemmas. 

Lemma 5.1. If z ~ A/"(0, E), 2 G 7?. p and HE 1 / 2 ^ < c _1 , then the probability 

w/jere 0(t) = (2^)-^ exp(-t 2 /2). 

Proof. Suppose n ~ A/"(0, /) and 2 = E^n, 

L.^ (27r)i|E|V2 exp(-^E- l z)cfa (32) 

= / (0 s p/2 exp(--v T v)dv (33) 

> fi-MMy (35 ) 



cf 
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Here we apply the fact that f°° <fi(t)dt < <fi(t)/t for t > 0. 



□ 



Lemma 5.2. Suppose ||Cn 1 ^ 2 ]|oo < k, then \PoLs\i > (c — ^)\^2Aogpa fori = 
1, ...,po and \poLs\i < £v / 2Tog~pcr /or z = po + 1, are satisfied with prob- 
ability at least ( 1 - m^EMA XP 



£ i/ 2 " logp/re 

Proo/. Since y = + e 



1 - Tr-^-i^jogp)-!^-^ 2 /- 2 )*'. 







+ 



x« T xw x« T x( 2 > 

J(2) T J(1) ^(2) T X (2) 



X^ 
X( 2 ) T f 



(36) 
(37) 
(38) 



Denote ( 



x« T xw x« T x( 2 ) 
x( 2 ) T x^) x( 2 ) T x( 2 ) 



xw T f 



which is a Gaussian 



random vec tor with mean and variance C n 1 a 2 /n. Hence by applying 
and noting ||Cn 1 ^ 2 a/ v /n|| 00 < Ka/yjn 



Lemma 



5.1 



wiiaii ^ t /oi ^ ^ 20(^v/2nlogp//t) 

V 4 v in log p/ k 



□ 



Theorem |4.2| is a consequence of the two lemmas. 
Proof. Since X T (y — X(3 OLS ) = 0, it implies 

1 p 1 p 

m = -\\x T y-x T xf5\\ l +\y^ w m = -\\x T x(/3 OLS -m+^wM 



1=1 



1=1 



Choosing A satisfying (20) and applying Lemma 5.1 and Lemma 5.2 we 
take the subgradient of l((3) and use the same arguments as in the proof of 
Theorem 3.1 With large probability, we have 

= o,i = p + 1, -,p 

and 



XW T XW,X« T X( 2 ) - /&) = 



15 



which implies 
Hence 



h = {PoLs)i,i = 1, -,Po 



by IKIloo < i^f^pa. 



(i) 

OLS \ _ 



/3« 

J \ 

2 





< 2£ 2 -po ■ logp- a 2 



(39) 

(40) 
(41) 
□ 



6 Computation and Implementation 

LAGS is a convex program, thus global minimum is assured. There are 
a lot of algorithms available to solve (J5J), such as the subgradient method, 
interior-point methods. However, like Dantzig Selector |2j, LAGS can also 
be reformulated linear program. 

Denote \{X T (y - X0))i\ = u h i = l,...,p; |A| = v h i = I,..., p. Then 
solving LAGS is equivalent to solve the following linear program with in- 
equality constraints 



min } Ui 

u,v,0 ^— ' 
1=1 



WiVi 



subject to 



-u 



<X T (y-X(3)< 
-v < (3 < v 



u 



(42) 



(43) 
(44) 



This linear program has 3p unknowns and 4p constraints. When p is relatively 
small, e.g., less than 100, solving linear program is more efficient; when p gets 
large, solving the convex program directly is recommended. 

If the computing environment contains the routine of solving regression 
under least absolute deviance criterion (LAD). LAGS can also be passed into 
the routine by treating an augmented samples and responses. Let us define 
(y*,X*), where (y*,x*) = {{X T y) h {X T X) t ) for 1 < i < p; (y* p+i , x* p+i ) = 
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(0, Xw^ej) for 1 < % < p, is the jth row of identity matrix. Then it can be 
verified that 

(3 = arg min||j/* — 

So LAGS can be solved without much programming effort. 

In most applications, we need to run a sequence of sparsity parameter A 
and choose the optimal one based on some criteria such as cross-validation. 
An efficient way to accomplish this is to pass the previous solution as the 
"warm start" for the new value of A. The justification of this technique 
is that in simplex algorithm, each iteration tries to find a better candidate 
solution at the vertices adjacent to the current one, if the difference between 
As are small, the new minimum should be close to the previous one, hence 
the new solution can be found in a few iterations. 

7 Numerical Simulation and Example 

7.1 Prostate Cancer Data 

For the sake of illustrating the Pseudo-Hard Thresholding property, we study 
the simple yet popular example — Prostate Cancer data [T5], [16] . The response — 
logarithm of prostate-specific antigen (Ipsa) is regressed on log(cancer vol- 
ume) (lcavol), log(prostate weight) (lweight), age, the logarithm of the amount 
of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), log(capsular 
penetration) (lcp), Gleason score (gleason) and percentage Gleason score 4 
or 5(pgg45). 

The data set is divided into two parts: a training set of 67 observations 
and a test set of 37 observations. It clearly shows in Figure [3] that the coeffi- 
cients profiles of LAGS and / penalty are quite similar. As A increases, the 
predictors are excluded from the model with the same order at almost the 
same As. The discrete selection processes of LAGS and Iq penalty are indi- 
cated by the jumps and constant segments in the profiles: the jumps means 
the predictors are either included in the model or not; while the constant seg- 
ments means the coefficients of the included predictors are unchanged even 
though A increases. Another interesting observation of the profiles is that 
the roles of some predictors are downplayed when there are many predic- 
tors included, which are mainly caused by the high correlation among them; 
however, as the included predictors are fewer, their roles are shown up and 
even increase — refer to the brown line in Figure [3j coefficient of lcavol. The 
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right panel is the prediction errors of LAGS and Iq penalty on the test set 
respectively, they are also piecewise constant functions of A. 




Figure 3: Coefficients profiles as a function of sparsity parameter A. There 
are eight predictors, all the predictors are first centered and standardized 
before passing to the solvers, then the outputs of solvers are transformed 
back to the original scale. The right panel is the prediction error for the test 
data. 

In contrast, the continuous shrinkage property of Lasso are demonstrated 
in Figure |4} The general trend of the coefficients is decreasing as A increases 
(this is not true in general, there exist examples that some coefficients in- 
crease even though A increases). The prediction error is also larger than that 
of LAGS and Iq penalty; but care should be taken that this argument can 
not be generalized; on the contrary, the discrete selection process usually 
exhibits higher variability hence higher prediction error. 

7.2 Diabetes Data 

The diabetes data are studied in There are total 442 patients (samples) 

and 10 predictors in the model. We fit a linear model on this data set. Since 
LAGS tries to minimize the gradient directly by the /i, it is insightful to 
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compare the gradients between LAGS and Lassdj We have both computed 
the LAGS and Lasso solution with 5- fold cross validation as in Table [TJ since 
LAGS sometimes demonstrates a little higher variability, we pick the most 
parsimonious model within 45%-50% standard error of the minimum instead 
of the "one-standard error" rule. LAGS has 4 nonzero coefficients while 
Lasso has 5. Moreover, we see in the table that the absolute inner product of 
predictors with the residue of LAGS is very sparse by its effort to minimizing 
the gradient directly, whereas that of Lasso is much denser and satisfies 

Xj{y - Xf3 lasso ) = A • sign(/3f sso ), V (3\ asso ± 

and 

\Xf{y - X(3 lasso )\ < A, V P\ ass ° = 

It is also notable that the magnitudes of the nonzero coefficients of LAGS 
are larger than that of Lasso. The reason is that Lasso both shrinks and 
selects, it tries to relax the penalty on relevant coefficients, as a consequence, 
some important predictors are downplayed by sharing their weights to others 
and selected model is relatively dense. This argument can be further verified 
by comparing the ||/3' asso ||i « 1335 and ||/3 ia9S ||i « 1617. Table § 

summarizes the correlations between the predictors and residue, which shows 
that LAGS enables some predictors to be exactly orthogonal to the residue. 
This simple data set shows the superiority of LAGS over Lasso. 

1 Lasso solution is computed by R package glmnet [TT] 



Lasso 

1 i 1 , 1 1 1.4 




'0 5 10 15 20 '0 5 10 15 20 

Figure 4: Coefficients profiles as a function of sparsity parameter A by the 
Lasso. 
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LAGS Lasso 



Variable j 


Xj(y-XP) 


Pi 


XT(y-XP) 


Pi 


1 


-27.9927 


0.0000 


14.5431 


0.0000 


2 


-134.9231 


0.0000 


-111.7310 


-33.3383 


3 


0.0000 


604.7797 


111.7310 


508.1903 


4 


0.0000 


268.1098 


111.7310 


210.3536 


5 


-53.2906 


-133.8965 


-55.5267 


0.0000 


6 


0.0000 


0.0000 


-54.0219 


0.0000 


7 


119.4116 


0.0000 


111.7310 


138.8478 


8 


73.3477 


0.0000 


66.2507 


0.0000 


9 


0.0000 


609.8394 


111.7310 


444.5615 


10 


39.7171 


0.0000 


101.9332 


0.0000 


Mean Squared Error 


3021 




3044 





Table 1: The gradient Xj(y — X(3) and coefficient (3j in each coordinate on 
the diabetes data with 10 predictors by LAGS and Lasso respectively. 





correlations between the predictors and residue 


LAGS 
Lasso 


-0.02 -0.12 0.00 0.00 -0.05 0.00 0.10 0.06 0.00 0.03 
0.01 -0.10 0.10 0.10 -0.05 -0.05 0.10 0.06 0.10 0.09 



Table 2: The correlations with the residue: Xj ■ res / 1 1 J\Tj 1 1 2 1 1 res 1 1 2 
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7.3 Simulated Data 



We compare the simulation performance of LAGS and Sparsenetj^] [13J and 
Lasso [16] with regard to training error, prediction error and number of non- 
zero coefficients in the model. We assume that the predictors and errors are 
Gaussian distributed. If X ~ Af(0, S) and e ~ Af(0, a 2 ), then the Signal-to- 
Noise Ratio (SNR) is defined as 

a 

We take E = S(p) G R pxp with l's on the diagonal and p on the off-diagonal. 
We generate two data sets with SNR = 2, p = 0.2 and SNR = 3, p = 0.4 
respectively, the sample sizes n are both 2000 and = (30, 29, 28, 1, 0970), 
p = 1000. In order to evaluate the performances of these three algorithms 
when p ^> n, we split the two data sets into training set with 500 samples 
and testing set with 1500 samples. Before passing the training set into the 
three algorithms, we first standardize the predictors, then choose the sparsity 
parameter A by 10-fold cross-validation; since p ^> n, the weights for LAGS 
are set to be the inverse of ridge estimate with <p = 0.2 (since each predictor 
is standardized, the diagonals of X T X are Is, <fi is chosen quite arbitrarily 
from (0,1)). The results are summarized in Table [3j we can observe that 
Lasso solution is overly dense, whereas Sparsenet solution is overly sparse; 
LAGS stays between the two and is closer to the true model. In Table |4| the 
data sets are split into training set with 1500 samples and testing set with 
500 samples, the weights are inverse of OLS estimates, A is again chosen by 
10-fold cross-validation. In both simulations, the prediction errors of LAGS 
are slightly larger than that of Sparsent because of the variability caused 
by discreteness; however, LAGS tends to discover the true models while 
Sparsenet tends to discover the sparser ones. 

8 Discussion 

8.1 Adapted version of Dantzig Selector 

In section 4, we argue that one possible explanation of DS in [6] is the 
uniformity of the weight put on each estimator. We adopt the choice of 

2 The data sets and Sparsenet solution are computed by R package sparsenet [13] 
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LAGS 


Sparsenet 


Lasso 


p = 0.2 
SNR = 
2 


# nonzeros: 33 
Training Error: 1915.4 
Testing Error: 2258.3 


# nonzeros: 20 
Training Error: 1874.3 
Testing Error: 2228.3 


# nonzeros: 92 
Training Error: 1624.3 
Testing Error: 2505.5 


p = 0.4 
SNR = 
3 


# nonzeros: 30 
Training Error: 710.3 
Testing Error: 785.5 


# nonzeros: 22 
Training Error: 714.2 
Testing Error: 772.0 


# nonzeros: 112 
Training Error: 584.1 
Testing Error: 938.8 



Table 3: n = 500, p = 1000, # test set = 1500. The number of true predictors 
is 30. 





LAGS 


Sparsenet 


Lasso 


p = 0.2 
SNR = 
2 


# nonzeros: 29 
Training Error: 1900.7 
Testing Error: 2164.9 


# nonzeros: 26 
Training Error: 1854.1 
Testing Error: 2064.4 


# nonzeros: 97 
Training Error: 1811.9 
Testing Error: 2274.6 


p = 0.4 
SNR = 
3 


# nonzeros: 29 
Training Error: 654.3 
Testing Error: 683.2 


# nonzeros: 26 
Training Error: 594.1 
Testing Error: 682.4 


7^ nonzeros: 122 
Training Error: 607.5 
Testing Error: 760.7 



Table 4: n = 1500, p = 1000, # test set = 500. The number of true predictors 
is 30. 
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weights as with LAGS. Thus, we need to solve the weighted version of DS 
j3 = arg min \\X T (y - Xf3)\\ OQ subject to V Ta i - t 

~[ \POLS\i 

We test our idea on the extended diabetes data where the interactions are 
included. The left panel of Figure [5] is the weighted version, while the right 
panel is the original version which is also appeared in [6] . The erratic behavior 
of DS is relatively mitigated with the weighted DS. 



8.2 Contributions in this paper 

In this paper, we propose a new algorithm for linear regression — LAGS and 
introduce the "pseudo-hard thresholding" properties which mimics the Iq 
regularization. Under mild conditions, we have proved that asymptotically, 
LAGS is consistent for model selection and parameter estimation. Since the 
strength of the variable selection algorithms lies in its finite sample perfor- 
mance, we have also established the nonasymptotic theorem which shows 
that with large probability, LAGS can discover the true model and the error 
of the estimated parameters is controlled under the noise level. In the proofs 
of these theorems, we emphasize that the weights on the parameters play a 
critical role for the effectiveness of LAGS. 

LAGS is a re-weighted regularization method which is first discussed in 
adaptive Lasso [19], it can be also interpreted as a multistage procedure: 
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first, we provide a very coarse estimate of the parameters of interest; second, 
based on this estimate, we are able to seek much better ones. Letting the 
regularization part depend on the data set makes the theories much more 
difficult; however, it usually results in better performance. 

The subject of variable selection in linear models has large bodies of 
literature. The efforts are mainly divided into two streams: on the one hand, 
the discrete selection procedures of Iq penalty methods such as AIC, BIC are 
shown to enjoy many nice properties, but they are highly impractical; on the 
other hand, the continuous shrinkage algorithms such as Lasso and LARS are 
computationally favorable but they do not have hard-thresholding property 
and the bias introduced by them is significant sometimes. Our work bridges 
the gap by pioneering the discrete selection process in the convex world. 
The attractive properties of LAGS indicate that in some applications, LAGS 
can be served as a surrogate for l Q penalty and an improved version of the 
continuous shrinkage methods. 

There are a group of algorithms on the shelf to solve LAGS. However, 
since LAGS needs to compute a sequence of solutions like Lasso and Sparsenet, 
one possible future work is to develop efficient path algorithms such as glmnet 
and sparsenet. 
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